LLM to HPC Tutorial

Coding

Tutorial

Author

Affiliation

Steven Mesquiti

Princeton University

How to Download and Set Up Meta LLaMA 3 Models on HPC (Adroit)

This guide walks you through configuring your environment, authenticating access, and downloading the LLaMA 3 models from Hugging Face on the Adroit HPC system. It assumes you have basic familiarity with command line, Python, and HPC usage.

This process is infinitely easier if you have connected VSCode to the adroit cluster. Here is some info on that. IT also runs help sessions on this, which are very useful.

Step 1: Configure the Hugging Face Cache Directory

By default, Hugging Face stores downloaded models in your home directory (e.g., /home/username/.cache/huggingface), which may have limited storage on HPC systems. Redirect the cache to your scratch directory for ample space.

You can do that by using the checkquota command in the terminal to find out how much space you have in your scratch directory.

Set the environment variable HF_HOME to your scratch directory:

On the Adroit login node, run this command to append the export statement to your .bashrc:

echo "export HF_HOME=/scratch/network/$USER/.cache/huggingface/" >> $HOME/.bashrc

Reload your shell configuration:

source ~/.bashrc

Verify the variable is set:

echo $HF_HOME

The excpected output should be:

/scratch/network/<YourNetID>/.cache/huggingface/

Step 2: Get Authentication Access from Meta (Required for LLaMA Models)

Meta requires users to accept a license and gain explicit access to the LLaMA 3 models on Hugging Face. So, this means you’ll need sign up for a Hugging Face account and request access to the LLaMA 3 models.

Go to the LLaMA 3 model page on Hugging Face: https://huggingface.co/meta-llama/Llama-3.1-8B (or whatever model you want access to)
Log in or create a Hugging Face account if you haven’t already.
Accept the model license terms: Click the “Access repository” button and agree to the license to request access.
Wait for access to be granted. This should be relatively quick, but may take a few minutes to a few hours depending on demand.

Step 3: Log In to Hugging Face CLI on HPC

Once access is granted, authenticate your HPC environment to allow downloading protected models.

Log in to Hugging Face CLI:

On the Adroit login node, run the following command:

huggingface-cli login

Enter your Hugging Face token:

You will be prompted to enter your Hugging Face access token. You can find this token in your Hugging Face account settings under “Access Tokens”. Copy and paste it into the terminal when prompted. Make sure not to share this with anyone since this is a personal access token that allows downloading models.

Step 4: Download the LLaMA 3 Model on the Login Node

Now that you have authenticated, you can download the LLaMA 3 model to your scratch directory.

Create a Python script download_llama3.py with this content:

Make sure to replace meta-llama/Llama-3.1-8B with the specific model you want
to download if different. Also, make sure you have transformers installed in your Python environment.

from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "meta-llama/Llama-3.1-8B"
cache_path = "/scratch/network/sm9518/.cache/huggingface"  # replace with your actual NetID

# Download model and tokenizer to cache
AutoTokenizer.from_pretrained(model_id, cache_dir=cache_path)
AutoModelForCausalLM.from_pretrained(model_id, cache_dir=cache_path)

print(f"{model_id} Downloaded Successfully! to {cache_path}")

Run the script on the login node:

python download_llama3.py

This will download all necessary model files into your scratch cache directory set by HF_HOME.

Step 5: Test the downloaded Model

Now that you have downloaded the LLaMA 3 model to your scratch directory, you can run inference on an HPC compute node.

5a. Create a Python script for testing (`run_test_llama.py`)

Save the following code to /scratch/network/$USER/python_test/run_test_llama.py. This script loads the model and runs a short text generation example using the transformers pipeline API.

from transformers import pipeline
import torch
import os

print(f"CUDA Available: {torch.cuda.is_available()}")

if not torch.cuda.is_available():
    raise ValueError(
        "CUDA is not available. Make sure you are running this on a GPU node. "
        "For example, run with Slurm requesting GPU:\n\n"
        "\tsalloc -t 0:10:00 --ntasks=1 --gres=gpu:1 python run_test_llama.py"
    )

model_path = "/scratch/network/$USER/.cache/huggingface/models--meta-llama--Llama-3.1-8B/snapshots/d04e592bb4f6aa9cfee91e2e20afa771667e1d4b"

print("CUDA_VISIBLE_DEVICES =", os.environ.get("CUDA_VISIBLE_DEVICES"))

pipe = pipeline("text-generation", model=model_path, tokenizer=model_path)

prompt = "You are an expert psychologist. Tell me something interesting about psychology regarding Erik Nook's research:"
output = pipe(prompt, max_new_tokens=50)

print("\nModel output:\n", output)

Make sure to replace $USER in the path with your actual NetID or use a variable if running programmatically.

5b. Create a Slurm job script (llama3-textgen.slurm)

This Slurm batch script requests one A100 GPU on the Adroit cluster and runs the above Python test script.

#!/bin/bash
#SBATCH --job-name=llama3-textgen           # Job name
#SBATCH --nodes=1                           # Use one node
#SBATCH --ntasks=1                          # One task
#SBATCH --cpus-per-task=1                   # One CPU core
#SBATCH --mem=36G                          # Memory request # can be ess
#SBATCH --gres=gpu:1                        # Request 1 GPU
#SBATCH --time=00:10:00                     # Max runtime (adjust as needed)
#SBATCH --constraint=a100                   # Use A100 GPU
#SBATCH --nodelist=adroit-h11g1             # Run on node with free GPUs
#SBATCH --mail-type=ALL                     # Email on start, end, fail
#SBATCH --mail-user=sm9518@princeton.edu   # Your email

module purge
module load anaconda3/2024.6
module load cudatoolkit/11.8 

source activate talkspaceMADS

cd /scratch/network/$USER/python_test

echo "Job started at $(date)"

export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True

python run_test_llama.py

echo "Job completed at $(date)"

💡 Check GPU Node Usage Before Selecting a Node or GPU.

Before submitting your job or manually specifying a GPU node (e.g., with #SBATCH --nodelist=adroit-h11g1).

It’s a good idea to check which nodes and GPUs have free memory or are under low load. Otherwise, your job might be assigned to a GPU that is already fully used, causing CUDA out-of-memory errors.

On Adroit, you can use commands like these from the login node to check GPU availability:

# Show GPU status and free GPUs per node 
shownodes -p gpu

If you don’t specify a node, Slurm will pick one for you, but it might not always be the best choice if GPUs on that node are busy.